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Y  ABSTRACT 

We  describe  a  parallel  implementation  of  a  chart  parser,  for  a  shared-memoiy  multipro¬ 
cessor.  The  speed-?ups  obtained  with  this  parser  have  been  measured  for  a  number  of 
small  naturalf^anguage  grammars.  For  the  largest  of  these,  part  of  an  operational 
questionranswering  system,  the  parser  ran  5  to  7  times  faster  than  the  serial  version.  . 


1.  Introduction 

We  report  here  on  a  series  of  experiments  to  determine  whether  the  parsing  com¬ 
ponent  of  a  natural  language  analyzer  can  be  easily  converted  to  a  pardlel  program 
which  provides  significant  speed-up  over  the  serial  program. 

These  experiments  were  prompted  in  part  by  the  rapidly  growing  availability  of 
parallel  processor  systems.  Parsing  remains  a  relatively  time-consuming  component  of 
language  analysis  systems.  This  is  particularly  so  if  constraints  are  being  systematically 
relaxed  in  order  to  handle  ill-formed  input  (as  suggested,  for  example,  in  [Weischedel 
and  Sondheimer  1983])  or  if  there  is  uncertainty  regarding  the  input  (as  is  the  case  for 
speech  input,  for  example).  This  time  could  be  reduced  if  we  can  take  advantage  of  the 
new  parallel  architectures.  Such  a  parallel  parser  could  be  combined  with  parallel  imple¬ 
mentations  of  other  components  (the  acoustic  component  of  a  speech  system,  for  exam¬ 
ple)  to  improve  overall  system  performance. 

2.  Background 

There  have  been  a  number  of  theoretical  and  algorithmic  studies  of  parallel  parsing, 
beginning  well  before  the  current  availability  of  suitable  experimental  facilities. 

For  general  context-free  grammars,  it  is  possible  to  adapt  the  Cocke-Younger- 
Kasami  algorithm  [Aho  and  Ullman  1972,  p.  314  ff]  for  parallel  use.  This  algorithm, 
which  takes  time  proportional  to  n^  (n  =  length  of  input  string)  on  a  single  processor,  can 
operate  in  time  n  using  n^  processors.  The  matrix  form  of  this  algorithm  is  well  suited  to 
large  arrays  of  synchronous  processors.  The  algorithm  we  describe  below  is  basically  a 


CYK  parser  with  top-down  filtering^  but  the  main  control  structure  is  an  event  queue 
rather  than  iteration  over  a  matrix.  Because  die  CYK  matrix  is  large  and  typically 
sparse  ,  we  felt  that  the  event-driven  algorithm  would  be  more  efficient  in  our  environ¬ 
ment  of  a  small  number  of  asynchronous  processors  («n^  for  our  longest  sentences)  and 
grammars  augmented  by  conditions  which  must  be  checked  on  each  rule  application  and 
which  vary  widely  in  compute  time. 

[Cohen  et  al.  1982]  present  a  general  upper  bound  for  speed-up  in  parallel  parsing, 
based  on  the  number  of  processors  and  properties  of  the  grammar.  Their  more  detailed 
analysis,  and  the  subsequent  work  of  Sarkar  and  Deo  [1985]  focus  on  algorithms  and 
speed-ups  for  parallel  parsing  of  deterministic  context-free  grammars.  Most  program¬ 
ming  language  grammars  are  deterministic,  but  most  natural  language  grammars  are  not, 
so  this  work  (based  on  shift-reduce  parsers)  does  not  seem  directly  applicable. 

Experimental  data  involving  actual  implementations  is  more  limited.  Extensive 
measurements  were  made  on  a  parallel  version  of  the  Hearsay-II  speech  understanding 
system  [Fennel  and  Lesser  1977].  However,  the  syntactic  analysis  was  only  one  of  many 
knowledge  sources,  so  it  is  difficult  to  make  any  direct  comparison  between  their  results 
and  those  presented  here.  Bolt  Beranek  and  Newman  is  currently  conducting  experi¬ 
ments  with  a  parallel  parser  quite  similar  to  those  described  below  [Haas  1987].  BBN 
uses  a  unification  grammar  in  place  of  the  procedural  restrictions  of  our  system.  At  the 
time  of  this  writing,  we  do  not  yet  have  detailed  results  from  BBN  to  compare  to  our 
own. 

3.  Environment 

Our  programs  were  developed  for  the  NYU  Ultracomputer  [Gottlieb  et  al.,  1983],  a 
shared-memory  MIMD  parallel  processor  with  a  special  instruction,  fetch-and-add,  for 
processor  synchronization.  The  programs  should  be  easily  adaptable  to  any  similar 
shared  memory  architecture. 

The  programs  have  been  written  in  ZLISP,  a  version  of  LISP  for  the  Ultracomputer 
which  has  been  developed  by  Isaac  Dimitrovsky.  Both  an  interpreter  and  a  compiler  are 
available.  ZLISP  supports  several  independent  processes,  and  provides  both  global  vari¬ 
ables  (shared  by  all  processes)  and  variables  which  are  local  to  each  process.  Our  pro¬ 
grams  have  used  low-level  synchronization  operations,  which  directly  access  the  fetch- 
and-add  primitive.  More  recent  versions  of  ZLISP  also  support  higher  level  synchroniza¬ 
tion  primitives  and  data  structures  such  as  parallel  queues  and  parallel  stacks. 

4.  Algorithms 

Our  parser  is  intended  as  part  of  the  PROTEUS  system  [Ksiezyk  et  al.  1987].  PRO¬ 
TEUS  uses  augmented  context-free  grammars  —  context-free  grammars  augmented  by 
procedural  restrictions  which  enforce  syntactic  and  semantic  constraints. 

'  We  ilfo  differ  fran  CYK  in  that  we  do  ikx  merge  different  tnalyiet  of  the  tame  string  as  the  tame  symbol.  As  a  result,  our 
procedure  would  not  operate  in  linear  time  for  general  (ambiguous)  grammars. 

’  For  grammar  #4  given  below  and  a  IS-word  sentence,  the  matria  would  have  roughly  15,000  entries,  of  which  oitly  about 
1000  entries  are  filled. 
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The  basic  parsing  algorithm  we  use  is  a  chart  parser  [Thompson  1981,  Thompson 
and  Ritchie  1984],  Its  basic  data  structure,  the  chart,  consists  of  nodes  and  edges.  For  an 
n  word  sentence,  there  are  n+1  nodes,  numbered  0  to  n.  These  nodes  are  connected  by 
active  and  inactive  edges  which  record  the  state  of  the  parsing  process.  If  ^  -^W  XY  Z 
is  a  production,  an  active  edge  from  node  nj  to  nj  labeled  hy  A  W  X  .Y  Z  indicates 
that  the  symbols  WX  of  this  production  have  been  matched  to  words  nj+1  through  n2  of 
the  sentence.  An  inactive  edge  from  Uj  to  nj  labeled  by  a  category  Y  indicates  that  words 
nj+1  through  n2  have  been  analyzed  as  a  constituent  of  type  Y.  The  "fundamental  rule" 
for  extending  an  active  edge  states  that  if  we  have  an  active  edge  A  -^W  X  .Y  Z  from  nj 
to  ttj  and  an  inactive  edge  of  category  Y  from  nj  to  n^,  we  can  build  a  new  active  edge  A 
W  X  Y  .Z  from  nj  to  n^.  If  we  also  have  an  inactive  edge  of  type  Z  from  n^  to  n^,  we 
can  then  extend  once  more,  creating  this  time  an  inactive  edge  of  type  A  (corresponding 
to  a  completed  production)  from  nj  to  n^. 

If  we  have  an  active  edge  A  W  X  .  Y  Z  from  Oj  to  n^,  and  this  is  the  first  time  we 
have  tried  to  match  symbol  Y  starting  at  (there  are  no  edges  labeled  Y  originating  at 
nj),  we  perform  a  seek  on  symbol  Y  at  we  create  an  active  edge  for  each  production 
which  expands  Y,  and  try  to  extend  these  edges.  In  this  way  we  generate  any  and  all  ana¬ 
lyses  for  Y  starting  at  n^.  This  process  of  seeks  and  extends  forms  the  core  of  the  parser. 
We  begin  by  doing  a  seek  for  the  sentence  symbol  S  starting  a  node  0.  Each  inactive 
edge  which  we  finally  create  for  S  from  node  0  to  node  n  corresponds  to  a  parse  of  the 
sentence. 

The  serial  (uniprocessor)  procedure^  uses  a  task  queue  called  an  agenda.  Whenever 
a  seek  is  required  during  the  process  of  extending  an  edge,  an  entry  is  made  on  the 
agenda.  When  we  can  extend  the  edge  no  further,  we  go  to  the  agenda,  pick  up  a  seek 
task,  create  the  corresponding  active  edge  and  then  try  to  extend  it  (possibly  giving  rise 
to  more  seeks).  This  process  continues  until  the  agenda  is  empty. 

Our  initial  parallel  implementation  was  straightforward:  a  set  of  processors  all  exe¬ 
cute  the  main  loop  of  the  serial  program  (get  task  from  agenda  /  create  edge  /  extend 
edge),  all  operating  from  a  single  shared  agenda.  Thus  the  basic  unit  of  computation 
being  scheduled  is  a  seek,  along  with  all  the  associated  edge  extensions.  If  there  are 
many  different  ways  of  extending  an  edge  (using  the  edges  currently  in  the  chart)  this 
may  involve  substantial  computation.  We  therefore  developed  a  second  version  of  the 
parser  with  more-fine-grained  parallelism,  in  which  each  step  of  extending  an  active  edge 
is  treated  as  a  separate  task  which  is  placed  on  the  agenda.  We  present  some  comparis¬ 
ons  of  these  two  algorithms  below. 

There  was  one  complication  which  arose  in  the  parallel  implementations:  a  race 
condition  in  the  application  of  the  "fundamental  rule".  Suppose  processor  Pj  is  adding  an 
active  edge  to  the  chart  from  node  nj  to  with  the  label  A  W  X  .  Y  Z  and,  at  the  same 
time,  processor  Pj  is  adding  an  inactive  edge  from  node  to  n^  with  the  label  Y.  Each 
processor,  when  it  is  finished  adding  its  edge,  will  check  the  chart  for  possible 
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application  of  the  fundamental  rule  involving  that  edge.  Pj  finds  the  inactive  edge 
needed  to  further  extend  the  active  edge  it  just  created;  similarly,  finds  the  active  edge 
which  can  be  extended  using  the  inactive  edge  it  just  created.  Both  processors  therefore 
end  up  trying  to  extend  the  edge  A  W  X  .  Y  Z,  and  we  create  duplicate  copies  of  the 
extended  edge  A  —^WX  Y  .Z.  This  race  condition  can  be  avoided  by  assigning  a  unique 
(monotonically  increasing)  number  to  each  edge  and  by  applying  the  fundamental  rule 
only  if  the  edge  in  the  chart  is  older  (has  a  smaller  number)  than  the  edge  just  added  by 
the  processor. 

As  we  noted  above,  the  context-free  grammars  are  augmented  by  procedural  restric¬ 
tions.  These  restrictions  are  coded  in  PROTEUS  Restriction  Language  and  then  com¬ 
piled  into  LISP.  A  restriction  either  succeeds  or  fails,  and  in  addition  may  assign 
features  to  the  edge  currently  being  built.  Restrictions  may  examine  the  substructure 
through  which  an  edge  was  built  up  from  other  edges,  and  can  test  for  features  on  these 
constituent  edges.  There  is  no  dependence  on  implicit  context  (e.g.,  variables  set  by 
another  restriction).  As  a  result,  the  restrictions  impose  no  complications  on  the  parallel 
scheduling;  they  are  simply  invoked  as  part  of  the  process  of  extending  an  edge. 

5.  Grammars 

These  algorithms  were  tested  on  four  grammars: 

(1)  A  "benchmark"  grammar: 

S->XXXXXXXXXXXX 

X  "a"  I  "b"  1  "c"  I  "d"  I  "e"  I  "f  I  "g"  I  "h"  I  "i"  I  "j" 

(2)  A  very  small  English  grammar,  taken  from  [Grishman  1986]  and  used  for  teach¬ 
ing  purposes.  It  has  23  nonterminal  symbols  and  38  productions. 

(3)  Grammar  #2,  with  four  restrictions  added. 

(4)  The  grammar  for  the  PROTEUS  question-answering  system,  which  includes 
yes-no  and  wh-  questions,  relative  and  reduced  relative  clauses.  It  has  35  non¬ 
terminal  symbols  and  77  productions. 


6.  Method 

The  programs  were  run  in  two  ways:  on  a  prototype  parallel  processor,  and  in  simu¬ 
lated  parallel  mode  on  a  standard  uniprocessor  (the  uniprocecessor  version  of  ZLISP  pro¬ 
vides  for  relatively  efficient  simulation  of  multiple  concuirent  processes).  The  runs  on 
our  prototype  multiprocessor,  the  NYU  Ultracomputer,  were  limited  by  the  size  of  the 
machine  to  8  processors.  Since  we  found  that  we  could  sometimes  make  effective  use  of 
larger  numbers  of  processors,  most  of  our  data  was  collected  on  the  simulated  parallel 
system.  For  small  numbers  of  processors  (1-4)  we  had  good  agreement  (within  10%,  usu¬ 
ally  within  2%)  between  the  speed-ups  obtained  on  the  Ultracomputer  and  under  simula¬ 
tion.^ 


*  For  larger  numbers  of  processors  (5-8)  ihe  speed-up  with  the  Ullracomputcr  was  consistenliy  below  thal  with  the  simulator 
This  was  due.  we  believe,  to  memory  contention  in  the  Ultracomputer.  This  contention  is  a  property  of  the  current  bus -bated  proto¬ 
type  and  would  be  greatly  reduced  in  a  machine  using  the  target,  network -based  architecture. 
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7.  Results 

We  consider  first  the  results  for  the  test  grammar,  #1,  analyzing  the  sentence 


JJJJJJJJJJJJ 

This  grammar  is  so  simple  that  we  can  readily  visualize  the  operation  of  the  parser  and 
predict  the  general  shape  of  the  speed-up  curve.  At  each  token  of  the  sentence,  there  are 
10  productions  which  can  expand  X,  so  10  seek  tasks  are  added  to  the  agenda.  If  10  pro¬ 
cessors  are  available,  all  10  tasks  can  be  executed  in  parallel.  Additional  processors  pro¬ 
duce  no  further  speed-up;  having  fewer  processors  requires  some  processors  to  perform 
several  tasks,  reducing  the  speed-up.  This  general  behavior  is  borne  out  by  the  curve 
shown  in  Figure  1.  Note  that  because  the  successful  seek  (for  the  production  X  — >  y) 
leads  to  the  creation  of  an  inactive  edge  for  X  and  extension  of  the  active  edge  for  S,  and 
these  operations  must  be  performed  serially,  the  maximal  parallelism  is  much  less  than 
10. 

The  next  two  figures  compare  the  effectiveness  of  the  two  algorithms  -  the  one  with 
coarse-grained  parallelism  (only  seeks  as  separate  tasks)  and  the  other  with  finer-grain 
parallelism  (each  seek  and  extend  as  a  separate  task).  The  finer-grain  algorithm  is  able  to 
make  use  of  more  parallelism  in  situations  where  an  edge  can  be  extended  in  several  dif¬ 
ferent  ways.  On  the  other  hand,  it  will  have  more  scheduling  overhead,  since  each 
extend  operation  has  to  be  entered  on  and  removed  from  the  agenda.  We  therefore  can 
expect  the  finer-grained  algorithm  to  do  better  on  more  complex  sentences,  for  which 
many  different  extensions  of  an  active  edge  will  be  possible.  We  also  expect  the  finer- 
grained  algorithm  to  do  better  on  grammars  with  restrictions,  since  the  evaluation  of  the 
restriction  substantially  increases  the  time  required  to  extend  an  edge,  and  so  reduces  in 
proportion  the  fraction  of  time  devoted  to  the  scheduling  overhead.  The  expectations  are 
confirmed  by  the  results  shown  in  Figures  2  and  3.  Figure  2,  which  shows  the  results 
using  a  short  sentence  and  grammar  #2  (without  restrictions),  shows  that  neither  algo¬ 
rithm  obtains  substantial  speed-up  and  that  the  fine-grained  algorithm  is  in  fact  slightly 
worse.  Figure  3,  which  shows  the  results  using  a  long  sentence  and  grammar  #3  (with 
restrictions),  shows  that  the  fine-grained  algorithm  is  performing  much  better. 

The  remaining  three  figures  show  speed-up  results  for  the  fine-grained  algorithms 
for  grammars  2,  3,  and  4.  For  each  figure  we  show  the  speed-up  for  three  sentences:  a 
very  short  sentence  (2-3  words),  an  intermediate  one,  and  a  long  sentence  (14-15  words). 
In  all  cases  the  graphs  plot  the  number  of  processors  vs.  the  true  speed-up  -  the  speed-up 
relative  to  the  serial  version  of  the  parser.  The  value  for  1  processor  is  therefore  below  1 , 
reflecting  the  overhead  in  the  parallel  version  for  enforcing  mutual  exclusion  in  access  to 
shared  data  and  for  scheduling  extend  tasks. 

Grammars  2  and  3  are  relatively  small  (38  productions)  and  have  few  constraints,  in 
particular  on  adjunct  placement.  For  short  sentences  these  grammars  therefore  yield  a 
chart  with  few  edges  and  little  opportunity  for  parallelism.  For  longer  sentences  with 
several  adjuncts,  on  the  other  hand,  these  grammars  produce  lots  of  parses  and  hence 
offer  much  greater  opportunity  for  parallelism.  Grammar  4  is  larger  (77  productions)  and 
provides  for  a  wide  variety  of  sentence  types  (declarative,  imperative,  wh-question,  yes- 
no-question),  but  also  has  tighter  constraints,  including  constraints  on  adjunct  placement. 
The  number  of  edges  in  the  chart  and  the  opportunity  for  parallelism  are  therefore  fairly 


large  for  short  sentences,  but  grow  more  slowly  for  longer  sentences  than  with  grammars 
2  and  3. 

These  differences  in  grammars  are  reflected  in  the  results  shown  in  Figures  4-6.  For 
the  small  grammar  without  restrictions  (grammar  #2),  the  scheduling  overhead  for  fine- 
grain  parallelism  largely  defeats  the  benefits  of  parallelism,  and  the  overall  speed-up  is 
small  (Figure  4).  For  the  same  grammar  with  restrictions  (grammar  #3),  the  effect  of  the 
scheduling  overhead  is  reduced,  as  we  explained  above.  The  speed-up  is  modest  for  the 
short  sentences,  but  high  (15)  for  the  long  sentence  with  15  parses  (Figure  5).  For  the 
question-answering  grammar  (grammar  #4),  the  speed-up  is  fairly  consistent  for  short 
and  long  sentences  (Figure  6). 

8.  Discussion 

Through  relatively  small  changes  to  an  existing  serial  chart  parser,  we  have  been 
able  to  construct  an  effective  parallel  parsing  procedure  for  natural  language  grammars. 
For  our  largest  grammar  (#4),  we  obtained  consistent  speed-ups  in  the  range  of  5-7. 
Grammars  for  more  complex  applications,  and  those  allowing  for  ill-formed  input,  will 
be  considerably  larger  and  we  can  expect  higher  speed-ups. 

One  issue  which  should  be  re-examined  in  the  parallel  environment  is  the  effective¬ 
ness  of  top-down  filtering.  This  filtering,  which  is  relatively  inexpensive,  blocks  the  con¬ 
struction  of  a  substantial  number  of  edges  and  so  is  generally  beneficial  in  a  serial  imple¬ 
mentation.  In  a  parallel  environment,  however,  the  filtering  enforces  a  left-to-right 
sequencing  and  so  reduces  the  opportunities  for  parallelism.  We  intend  in  the  near  future 
to  try  a  version  of  our  algorithm  without  top-down  filtering  in  order  to  determine  the  bal¬ 
ance  between  these  two  effects. 
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Figure  2.  Speed-up  (relative  to  serial  parser)  for  grammar  #2  (small  grammar  without  res 
trictions)  on  a  3*word  sentence  for  the  coarse-grained  algorithm  ("SEMP")  and  the  fine 
grained  algorithm  ("SEMMOD"). 


Figure  5.  Speed-up  (relative  to  serial  parser)  for  granmiar  #3  (small  grammar  with  res¬ 
trictions)  using  the  fine-grained  algorithm  for  three  sentences:  a  14- word  sentence  (curve 
1),  a  5-word  sentence  (curve  2),  and  a  3-word  sentence  (curve  3). 


-4  2.- 


I 


‘S 

>3 

% 

•S 

I 

I 

I 

I 


Kyja 

ijl 

4* 

1^ 

l« 


sa 


E/\rJ) 

J)/9  7^ 

pjL/^ah 

Pl/IRC^ 

J>r/c 


